6 research outputs found
WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models
Large pretrained language models (LMs) have become the central building block
of many NLP applications. Training these models requires ever more
computational resources and most of the existing models are trained on English
text only. It is exceedingly expensive to train these models in other
languages. To alleviate this problem, we introduce a novel method -- called
WECHSEL -- to efficiently and effectively transfer pretrained LMs to new
languages. WECHSEL can be applied to any model which uses subword-based
tokenization and learns an embedding for each subword. The tokenizer of the
source model (in English) is replaced with a tokenizer in the target language
and token embeddings are initialized such that they are semantically similar to
the English tokens by utilizing multilingual static word embeddings covering
English and the target language. We use WECHSEL to transfer the English RoBERTa
and GPT-2 models to four languages (French, German, Chinese and Swahili). We
also study the benefits of our method on very low-resource languages. WECHSEL
improves over proposed methods for cross-lingual parameter transfer and
outperforms models of comparable size trained from scratch with up to 64x less
training effort. Our method makes training large language models for new
languages more accessible and less damaging to the environment. We make our
code and models publicly available.Comment: NAACL 202
Semantic HELM: A Human-Readable Memory for Reinforcement Learning
Reinforcement learning agents deployed in the real world often have to cope
with partially observable environments. Therefore, most agents employ memory
mechanisms to approximate the state of the environment. Recently, there have
been impressive success stories in mastering partially observable environments,
mostly in the realm of computer games like Dota 2, StarCraft II, or MineCraft.
However, existing methods lack interpretability in the sense that it is not
comprehensible for humans what the agent stores in its memory. In this regard,
we propose a novel memory mechanism that represents past events in human
language. Our method uses CLIP to associate visual inputs with language tokens.
Then we feed these tokens to a pretrained language model that serves the agent
as memory and provides it with a coherent and human-readable representation of
the past. We train our memory mechanism on a set of partially observable
environments and find that it excels on tasks that require a memory component,
while mostly attaining performance on-par with strong baselines on tasks that
do not. On a challenging continuous recognition task, where memorizing the past
is crucial, our memory mechanism converges two orders of magnitude faster than
prior methods. Since our memory mechanism is human-readable, we can peek at an
agent's memory and check whether crucial pieces of information have been
stored. This significantly enhances troubleshooting and paves the way toward
more interpretable agents.Comment: To appear at NeurIPS 2023, 10 pages (+ references and appendix),
Code: https://github.com/ml-jku/hel
Learning to Modulate pre-trained Models in RL
Reinforcement Learning (RL) has been successful in various domains like
robotics, game playing, and simulation. While RL agents have shown impressive
capabilities in their specific tasks, they insufficiently adapt to new tasks.
In supervised learning, this adaptation problem is addressed by large-scale
pre-training followed by fine-tuning to new down-stream tasks. Recently,
pre-training on multiple tasks has been gaining traction in RL. However,
fine-tuning a pre-trained model often suffers from catastrophic forgetting.
That is, the performance on the pre-training tasks deteriorates when
fine-tuning on new tasks. To investigate the catastrophic forgetting
phenomenon, we first jointly pre-train a model on datasets from two benchmark
suites, namely Meta-World and DMControl. Then, we evaluate and compare a
variety of fine-tuning methods prevalent in natural language processing, both
in terms of performance on new tasks, and how well performance on pre-training
tasks is retained. Our study shows that with most fine-tuning approaches, the
performance on pre-training tasks deteriorates significantly. Therefore, we
propose a novel method, Learning-to-Modulate (L2M), that avoids the degradation
of learned skills by modulating the information flow of the frozen pre-trained
model via a learnable modulation pool. Our method achieves state-of-the-art
performance on the Continual-World benchmark, while retaining performance on
the pre-training tasks. Finally, to aid future research in this area, we
release a dataset encompassing 50 Meta-World and 16 DMControl tasks.Comment: 10 pages (+ references and appendix), Code:
https://github.com/ml-jku/L2
Reactive Exploration to Cope with Non-Stationarity in Lifelong Reinforcement Learning
In lifelong learning, an agent learns throughout its entire life without
resets, in a constantly changing environment, as we humans do. Consequently,
lifelong learning comes with a plethora of research problems such as continual
domain shifts, which result in non-stationary rewards and environment dynamics.
These non-stationarities are difficult to detect and cope with due to their
continuous nature. Therefore, exploration strategies and learning methods are
required that are capable of tracking the steady domain shifts, and adapting to
them. We propose Reactive Exploration to track and react to continual domain
shifts in lifelong reinforcement learning, and to update the policy
correspondingly. To this end, we conduct experiments in order to investigate
different exploration strategies. We empirically show that representatives of
the policy-gradient family are better suited for lifelong learning, as they
adapt more quickly to distribution shifts than Q-learning. Thereby,
policy-gradient methods profit the most from Reactive Exploration and show good
results in lifelong learning with continual domain shifts. Our code is
available at: https://github.com/ml-jku/reactive-exploration.Comment: CoLLAs 202
Improving Generalization of Deep Convolutional Neural Networks for Acoustic Scene Classification
In recent years deep learning has become one of the most popular machine learning techniques for a vast variety of complex problems. An example for such a task is to mirror the human auditory system to classify audio recordings according to the location they were recorded in. This work focuses mainly on the Acoustic Scene Classification task proposed by the IEEE DCASE Challenge. The dataset for Acoustic Scene Classification consists of recordings from distinct recording locations. The aim of the challenge is to classify an unseen test set of recordings. In the challenge of 2016 the training and test set did not differ significantly. In the challenge of 2017, however, the test set originated from a different distribution, implying a strong need for generalization. In the course of this work, the initial implementation consisting of a Deep Convolutional Neural Network for the DCASE 2016 challenge submission (done in Lasagne) was re-implemented in Keras. An extension of the ADAM optimizer (AMSGrad) was investigated for improvement in generalization. Other submissions to the DCASE 2017 challenge suggest that different types of spectrograms might be key for better generalization. Therefore experiments utilizing different kinds of spectrograms were conducted. Furthermore, different interpolation algorithms were used for data augmentation, with some of them yielding significant improvements in classification accuracy and generalization. For different spectrogram dimensions, slight adjustments in the network architecture also resulted in a performance gain. To better understand what different models "see" and what they focus on, their filters, and activations were visualized and compared for differences. Finally the adjustments which led to better generalization on the dataset of the DCASE 2016 challenge were tested on the dataset of the DCASE 2017 challenge, leading to an improvement over all submissions to the DCASE 2017 challenge from the Institute of Computational Perception
History Compression via Language Models in Reinforcement Learning
In a partially observable Markov decision process (POMDP), an agent typically
uses a representation of the past to approximate the underlying MDP. We propose
to utilize a frozen Pretrained Language Transformer (PLT) for history
representation and compression to improve sample efficiency. To avoid training
of the Transformer, we introduce FrozenHopfield, which automatically associates
observations with pretrained token embeddings. To form these associations, a
modern Hopfield network stores these token embeddings, which are retrieved by
queries that are obtained by a random but fixed projection of observations. Our
new method, HELM, enables actor-critic network architectures that contain a
pretrained language Transformer for history representation as a memory module.
Since a representation of the past need not be learned, HELM is much more
sample efficient than competitors. On Minigrid and Procgen environments HELM
achieves new state-of-the-art results. Our code is available at
https://github.com/ml-jku/helm.Comment: ICML 202